Automatic Dynamic Contextual Date Entry Completion

ABSTRACT

A method performed in a computer device having associated therewith a plurality of unstructured documents having words therein, the method involves accessing at least some of the plurality of unstructured documents, extracting a multiset of words, forming a matrix from the documents in which each word in the multiset is represented in a column and each document from which the words came is represented in a row, treating each document as a vector in a multidimensional Euclidean space, uniquely pairing the unique documents, measuring the similarity between the pairs as a cosine of the angle between vectors, comparing the cosines to a specified threshold to determine relatedness among the documents, and based upon the relatedness, when an input is received by the computer device representing a string of a threshold number of characters, the computer device will provide at least one word that would complete the character string.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No. 11/422,939 filed Jun. 8, 2006, which is a continuation-in-part of U.S. patent application Ser. No. 11/040,470 filed Jan. 21, 2005, the entirety of these applications are all incorporated herein by reference as if fully set forth herein.

FIELD OF THE INVENTION

The present invention relates to information processing and, more particularly, computer, cell phone, personal digital assistant, or other similar device-based text entry.

BACKGROUND OF THE INVENTION

In modern life, there are a number of devices, notably digital computers and multifunctional handheld units that involve data entry, typically text, including for example cellular phones and other devices like organizers and handheld computers. For all of these, one important use is the entry of linguistic items like words, phrases, and sentences. For example, a user may create an unstructured text document or might formulate an email message or a short text message to be sent as an SMS message on a cell phone. In such cases, text entry may occur through use of a keyboard or stylus for some handheld computers or cell phones, etc. However, data entry can be difficult when the keyboard is relatively small as it is on a handheld cell phone, organizer or computer, or uses individual keys for entry of multiple letters, text, especially when a large number of characters must be entered. Similarly, with devices employing a stylus for text entry, entry of text can be slow and burdensome.

Automated word completion programs have eased the burden somewhat. Such automated word completion programs have appeared recently in a variety of applications in a variety of devices. These programs are typically based on either predefined word suggestion lists (e.g. a dictionary) or are culled from the user's own most recently typed terms, the latter often called MRU (i.e. “Most Recently Used”) programs. For example, the former type of program is based on a pre-given word suggestion list based on a dictionary augmented with information about which words are more frequently used. If a user types the characters “su” in a document, then it might suggest “super” as the appropriate word completion based on the fact that it belongs to the pre-given word suggestion list and has a high frequency of use in general English. On the other hand, the latter type of program would suggest a word completion based on the user's own recently used words (e.g. “supreme” may be suggested to a lawyer who has recently input “Supreme Court”). Such programs are often found in web browsers for example and will suggest the most recently used “uniform resource locator” or URL (e.g. www.google.com when the user types “www.g”) as characters are input.

A third type of program is able to detect that the user is in a particular type of field (e.g. the closing of a letter) and will suggest word completions (e.g. “Sincerely” when the user types “Si”) based on a more limited “contextual” list. An extension of this is to maintain many separate word suggestion lists and allow the user to choose an appropriate list for each document the user creates. Other variants allow users to actually insert entries manually into word suggestion lists (e.g. a name and address) or to maintain frequencies of word usage by a user and thus, rather than offering the most recently used word, offer the user's most frequently used words.

SUMMARY OF THE INVENTION

While the methods delineated above have many useful features, there is still a lack of a true context based system that is dynamic and automatic and thus, there is still much room for improvement when it comes to data entry in such devices. Systems that maintain separate word lists and allows the user to choose an appropriate list are contextual to some degree, but still have the drawback of requiring the user to make a list selection each time, something that can become annoying for a user who typically creates several documents within the course of a single day. Moreover, separate word suggestion lists are still inefficient because they are not automatically generated but instead depend on the user's guidance and input.

The present invention combines certain features from existing techniques but goes significantly beyond them in creating a family of techniques that are automatic, dynamic, and context based as explained in greater detail herein.

One aspect involves a method performed in a computer device having associated therewith a plurality of unstructured documents having words therein, the unstructured documents being stored in non-transient memory accessible to the computer device. The method involves accessing at least some of the plurality of unstructured documents, extracting from each document making up the at least some of the plurality of unstructured documents, a multiset of words, forming a matrix from the documents in which each word in the multiset is represented in a column and each document from which the words came is represented in a row, treating each document as a vector in a multidimensional Euclidean space with the number of dimensions being equal to one of the number of columns or number of rows in the matrix, uniquely pairing the unique documents making up the at least some of the plurality of unstructured documents, measuring the similarity between the pairs of unique documents making up the at least some of the plurality of unstructured documents as a cosine of the angle between vectors, comparing the cosines to a specified threshold to determine relatedness among the documents, and based upon the relatedness, when an input is received by the computer device representing a string of a threshold number of characters, the computer device will provide at least one word that would complete the character string.

The advantages and features described herein are a few of the many advantages and features available from representative embodiments and are presented only to assist in understanding the invention. It should be understood that they are not to be considered limitations on the invention as defined by the claims, or limitations on equivalents to the claims. For instance, some of these advantages or features are mutually exclusive or contradictory, in that they cannot be simultaneously present in a single embodiment. Similarly, some advantages are applicable to one aspect of the invention, and inapplicable to others. Thus, the elaborated features and advantages should not be considered dispositive in determining equivalence. Additional features and advantages of the invention will become apparent in the following description, from the drawings, and from the claims.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 illustrates, in simplified form, a top-level flowchart for the automatic completion of character input using contextual word associations;

FIG. 2 a illustrates a simplified flowchart for computing contextual associations in one example implementation of the invention;

FIG. 2 b illustrates a simplified flowchart for the selection of contextual associations in an example implementation of the invention;

FIG. 3 illustrates an example documents versus words matrix used to compute contextual associations with an example implementation of the invention;

FIG. 4 a illustrates a simplified flowchart for computing contextual associations in an alternative example implementation of the invention;

FIG. 4 b illustrates a simplified flowchart for the selection of contextual associations in an alternative example implementation of the invention;

FIG. 5 illustrates an example matrix of pairs of words used to compute contextual associations in the alternative example implementation of the invention; and

FIG. 6 illustrates an example set of word lists for a word completion example involving the alternative example implementation.

DETAILED DESCRIPTION OF THE INVENTION

The present invention can be used with a variety of electronic devices. The minimum requirements for any such device are some means for accepting textual input from a user, one or more processor(s) that execute stored program instructions to process the input, storage for the data and the program instructions and a display or other output device of some sort to make output visible or available to the user. Representative, non-exhaustive, example input devices can include, but are not limited to, a keyboard, a handwriting recognition system that makes use of a stylus, a touch pad, a telephone keypad, a pointing device like a mouse, joystick, trackball or multi-directional pivoting switch or other analogous or related input devices. The storage preferably includes non-volatile memory, and can also include volatile semiconductor-based memory, electro-magnetic media, optical media or other types of rewriteable storage used with computer devices. If a display is used, the display may be small and capable of displaying only text or much larger and capable of displaying monochrome or color images in addition to text. If another output device is used, like a text to speech converter, appropriate implementing equipment will be included. Although described, for purposes of clarity, with reference to keyboard-type entry, it is to be understood that the present invention is independent of the particular mode of, or device used for, text data entry.

At the outset, it should be noted that, for the purposes of this invention, a “document” as used herein is intended to be a very general term covering one or more characters, whether alone or in conjunction with numerals, pictures or other items. A document's length can vary from a single “word” to any number of words and it can contain many types of data other than words (e.g. numbers, images, sounds etc.). Thus, ordinary documents such as pages of text are documents, but so are spreadsheets, image files, sound files, emails, SMS text messages etc.

As noted above, a “word,” for the purposes of this invention, can be considered to be more than a string of alphabetic characters, it may include numeric and other symbols as well. Broadly, the invention provides contextual completion of character strings, where a character string includes not only alphabetic words but any other discrete collection of characters, symbols, or stroke based pictographs or ideograms, for example, those used in languages like Chinese, Korean and Japanese, and thus can benefit from use of the present invention. Thus, although for simplicity the term “word” is used in the following discussion, it should be understood to encompass any discrete collection of characters, symbols or other stroke based representations of communicative concepts, thoughts or ideas. Thus, the present invention, although described with reference to English, is independent of any particular language. It can be used for phonetic, pictographic or ideographic languages when the characters, pictograms or ideograms used therein (or “stroke” components thereof) are considered “words” and thereby are intended to be incompassed by the terms “text” and “textual.” In some cases, an entire pictogram or ideogram will be usable as a “word” as described herein with entry of a component of the pictogram or ideogram, such as a defined stroke, being analogous to entry of a letter in English. Likewise, for simplicity in the following examples, the terns “typing” or “typed” are used to describe data entry. However, those terms should be broadly read to encompass any and all methods of data entry, whether involving entry through use of a keyboard, a pointing or selection device, a stylus or other handwriting recognition system, etc. They are not in any way intended to be limited only to methods that make use of a typewriter-like keyboard.

Examples of devices that can use and benefit from incorporation of the invention therein range from large computer networks, where an implementation of the invention may be part of or an application on the network, to small portable hand held devices of more limited or specialized function such as cell phones, text messaging devices and pagers. Implementations incorporating the invention can be used to assist users in interacting with large databases by helping in the entry of search terms or in data entry. Other implementations incorporating the invention are particularly useful for portable devices, in which the input device is limited by size and difficult to work with, because the automatic completion of character string entries provides greater benefits in such devices. Still other implementations incorporating the invention are particularly useful for devices used by those with physical handicaps. In addition to the methods of character input already mentioned, devices intended for use by handicapped individuals may rely on some type of pointing device to select individual characters for input. The pointing device may be controlled by movement of the eyes, head, hands, feet or other body part depending on the abilities of the particular individual. The present invention may also be used with “text” that is implemented in braille or other tactile representations for individuals with impaired vision.

In overview, in connection with the invention, words from one or more documents are associated, in either a fully or partially automated way, based on context. Context is derived from the co-occurrence of individual words in documents. In addition, the associations can be pre-computed and static or dynamic so they can thereby evolve and improve with continued use.

For example, in an implementation of the invention, an association between “finance” and “summary” may be generated but not one between “finance” and “sugar;” in this case, if a user has typed in the word “finance” followed by the characters “su,” then, based on the association, the invention will suggest “summary” as the appropriate word completion rather than “sugar.” Here, the word “finance” has provided the context that suggests the appropriate completion; if instead the user had typed “two spoons of” and then the characters “su,” and if an association had been generated between, “spoon” and “sugar” rather than “spoon” and “summary” then the invention would suggest “sugar” as the contextually appropriate completion. As more words are entered in the document, the contextual associations become richer.

The invention permits the use of different techniques for actually creating the associations. As a result, for purposes of understanding, two fully automated example techniques are described below with the understanding that semi-automatic implementation techniques are considered to be literally the same as the fully automated ones. The automatic or manual nature of a technique is, in most respects, independent of the invention because it relates more to the ease of processing large amounts amount of text, not the technique itself.

The general approach is illustrated, in simplified overview, in FIG. 1 with respect to a single document. The approach begins with a device such as a personal digital assistant, cell phone, computer or other device (100, 110, 120 or 130) which has documents (140) stored in its memory. These documents are used to create associations (150) between pairs of words or character strings within the document and use these associations to suggest word or character string completions (170) to the user entering text (160) in a document. The associations among the words or strings may be static or dynamic. With implementations incorporating a more dynamic approach, as the user adds to a document or creates more documents on the device, the associations are recomputed or suitably augmented. This will alter the set of associations by either adding new associations, deleting existing associations or both. Thus, with implementations of the automatic contextual word completion system having this “dynamic” aspect, the system evolves as the user adds to or creates new documents and thus generally improves with use. Extensions to these implementations further allow the device to impliedly track the user's evolving interests.

Associations between words can be computed in a variety of ways and, as non-limiting examples, two alternative automatic methods of doing so are described.

In the first method, the first step is to assess the similarity of words within one document or from one document to other documents that may exist on the user's device. In this method, contextual associations are arrived at by grouping documents based on similarity and creating lists of words that are common to each group. There are many known methods to assess document similarity including the Jaccard, Dice or cosine coefficients and the K-vec methods. For purposes of explanation, one such example similarity assessment method, based on treating documents as vectors in a multidimensional space, is used, it being understood that, depending on the particular implementation, other similarity assessment methods can be used in addition to, or instead of, those used in the examples described herein for practical reasons.

This example method is outlined in the flowcharts in FIGS. 2 a and 2 b. The method starts by creating a list of all the pertinent documents (200) on the device. From this list of pertinent documents a list of unique words is created (205). An optional step, is to remove stop words from the word list (210). Stop words are described in greater detail below but include words like “the,” “at” and “in.” For each word in the word list, the number of times it occurs in each document is counted (215) and this number is stored in a matrix of documents vs. words (220). This matrix is used to calculate a similarity value (225) for each possible pair of documents in the document list. The similarity value for each document pair is compared to a threshold value (230) and those document pairs whose similarity value falls below the specified threshold value are discarded (235). The remaining document pairs are used to group documents such that the similarity value of each possible pair in each group is above a specified threshold value (240). Lists of unique words from each group of similar documents are created (245). Words within each of these lists are contextually related. The steps of the example method to this point may be carried out independently of user text entry or, in implementations where the dynamic aspects of the invention are utilized, carried out simultaneously with user text entry, so that the contextual associations are updated as the user enters more words into the device.

Once at least an initial set of contextual associations exists, it can be used at some point thereafter. The approach to use is as follows. The device accepts character input from the user until a specified threshold number of characters has been entered (250). Using the entered characters, relevant word lists are identified (255). Due to the processing, the words within these identified lists are deemed contextually related and thus, words in the identified lists having a corresponding initial character string matching the entered characters are chosen (260) to be offered for selection by the user to complete the character entry (265).

The above referenced process can be fully understood by way of the following simplified example. To assess the similarity or dissimilarity of documents, one way of thinking of a document that contains one or more words is as a bag or multiset of words. A bag or multiset is like an ordinary set in mathematics, a collection, except that it can contain multiple occurrences of the same element. For example, {book, cape, pencil, book} is a bag containing four words of which the word “book” appears twice. The order of occurrence of elements in a bag does not matter, and could equally be written as {book, book, pencil, cape}. Also, any bag can be converted to a set just by dropping multiple occurrences of the same element. Thus, the example bag above, when converted to a set, would be {book, cape, pencil}. To create the bag or multiset, the contents of a document with the exception of numbers which are a special case are stripped of all internal structure (e.g. syntactic structure, punctuation etc.) including all non-lexical items like images, sounds etc. The resulting stripped document would be a bag or multiset of words as described above which may also include numbers and in which some words may occur multiple times. For a user who has a device with a number of stored documents, each pertinent document is similarly stripped down to form bags and the mathematical union of these bags can be taken to form a larger bag.

As a side note, optionally, a certain class of words, typically called “stop words,” are removed from such document-derived bags. Stop words are words like “the,” “of,” “in” etc. and are removable because they usually are not very informative about the content of the document. Stop words, if removed, can be removed from the bags either before or after a mathematical union of the bags is made, as the end result is the same. Typically stop words are identified in a list which can be used for the exclusion process. Since the stop word removal process is well known it is not described herein. In addition, in some implementations where a stop word list is used, the list may be editable so that additional words can be defined as “stop words.” For example, otherwise non-trivial words that are trivial in the particular context because they occur too often in that context (e.g. words like “shares” in stock related government filings).

By way of simplified example (FIG. 3), if the user has just two documents on a device: “d1” (306) made up of “an apple, apple cider and an orange” and “d2” (308) made up of “a paper apple” then, each corresponding bag is {apple, cider, apple, orange} and {paper, apple}. Their union is the larger bag {apple, cider, apple, orange, paper, apple} and a set for the bag would be {apple, cider, orange, paper}.

A matrix (300) is then formed with for example, each element in the set of words derived from the documents on the user's device listed along the columns (302) of the matrix and each document itself (symbolized in some way) along the rows (304) of the matrix. In the cell corresponding to the intersection of a document “d” with a word “w,” the number of times “w” occurs in “d” is entered (318). For the simple example above, as shown in FIG. 3, for the cell corresponding to the intersection of the row for the first document “d1” and the column for the word “apple” a “2” (318) is entered since it occurs twice in document “d1.” This occurrence frequency information is obtained from the document bags. If a word does not occur in a particular document at all, a zero is entered in the corresponding cell. Note that depending upon the number of documents and the number of words, the size of the matrix can be exceedingly large. Moreover, there is no significance to whether rows list documents and columns list words or vice versa—the contents of the rows and columns could be exchanged without affecting the invention.

Once the matrix is created, each document is treated as a vector in a multidimensional Euclidean space, with the number of dimensions being the number of words or columns of the matrix. Thus, the simplified example of FIG. 3, each of documents d1 and d2 can be treated as a four dimensional vector since there are four elements in the corresponding set {apple, cider, orange, paper}. Notably, by using this approach, the words can also be listed in any order in the columns, for example, in order of first or last occurrence, randomly, alphabetically, by number of letters, stroke type, stroke count, etc., in forward/increasing or reverse/decreasing order.

The similarity between d1 and d2 is then estimated by measuring the cosine of the angle between vectors d1 and d2. The higher the cosine value, the greater the similarity, the lower the cosine, the greater the dissimilarity. If the value is the maximum possibility of “1,” the vectors are fully aligned in multidimensional space and the two are identical. On the other hand, if the cosine reaches its minimum of “0,” the vectors or documents are completely dissimilar (i.e. they are orthogonal). For the absolute intermediate cosine value of “0.5,” the documents are roughly as similar as they are dissimilar.

This method can thus be extended where more than two documents exist, by pairing the documents in all unique combinations and performing the same similarity analysis on all pairs of documents that are pertinent (i.e. under consideration) from the device of the user. Note that it is likely that every “document” on the user's device may not be “pertinent” for purposes of this invention, for example, it will likely be desirable to differentiate between the user's documents and documents that may be present for other reasons (e.g. programs, system configuration file contents, system data, etc.) and thus irrelevant. For purposes of understanding the invention, it should be assumed that non-pertinent documents have been excluded in some manner, whether manually or automatically.

As to all of the pairs, cosine values are computed and a threshold value is chosen. The threshold value represents a divider in which all cosine values above the threshold are accepted as indicating meaningfully similar document pairs and below which the document pairs are sufficiently dissimilar so as to be considered not meaningfully related. In other words, a range of values is taken and on the basis of those values, an arbitrary division is formed to differentiate, and deem each document, meaningfully similar or dissimilar relative to each other document.

In application, the threshold value mentioned above is initially a preset parameter, the value of which may be different for different types or classes of target users. However, in some implementations, the threshold can be adjusted, whether over time through use or because of changes in the quantity or contents of the pertinent documents, and either by program modification or through user action. In other implementations, it can be adjusted automatically or by implication as is discussed in greater detail below.

In all cases, what is obtained, either through the vector space method described above or through other analysis approaches, such as Jaccard, Dice or cosine coefficients, the K-vec methods or some other method, is a judgment of similarity and dissimilarity of document pairs in the pertinent document collection or set.

This similarity judgment is then used to form groups of documents, each group of which contains only documents that are sufficiently similar to one another when compared in a pair wise fashion. Note that the relationship of similarity is reflexive and symmetric, but it is not necessarily transitive. This means that the groups may not be disjoint i.e. the same document may belong to more than one group, particularly in implementations where a document need not be sufficiently similar to every other member of the group, but only some specified portion thereof. In other words, as a result of the grouping, two or more groups will be formed wherein each document is meaningfully similar to at least some specified portion of the other documents in that group. In general, each group will have the following characteristics: each document in the group will be sufficiently similar (i.e. be above the threshold) with respect to at least some specified portion of the group (in practice, likely preset at well more than 50% if stop words are not removed), each group will be differentiable from any other group by at least one member, and each group will not be a simple subset of any other group.

In addition, some implementations may involve large numbers of nearly identical documents such that the common information is not meaningful because it is ubiquitous and the differentiating information may be too unique relative to each other to be of interest. One such example, would be certain filled out forms. In such a case, a second very high threshold can be used as a “screen” whereby copies of documents exceeding this threshold are separately grouped, the mathematical intersection of the documents are used to form a “stop word”-like list, the intersection-based stop word list is then used to strip the copies of those words, and then the copies are used in place of the actual documents in the overall grouping process. In this manner, the ubiquity can be handled without sacrificing the potential contextual relevance of the unique information to other pertinent documents.

Once groups of similar documents have been made, lists are formed for each group containing all non-stop words in the union of the sets corresponding to each document in the group. As with the lists above, these lists may be arranged in any of a variety of possible orders (e.g. alphabetical, first/last occurrence, etc.). In addition, the lists can dynamically change, for example, by repeating the above processing (in whole or part) as time progresses and the user creates/modifies/deletes documents.

As a result, the lists created as described above are essentially lists of all words associated with each other i.e. once the lists have been computed, the associations mentioned earlier have been computed as well. By way of simple example with respect to the two documents d1 and d2 introduced above, if a list contains the words {apple, cider, orange} then it can be assumed that these words are associated with each other. The word “paper” from d2 above is not on this list and so is not associated with “apple,” “cider” or “orange” through this list, but may be associated with apple through another list “d3” that contains {apple, orange, knife, paper, plate}. It may also happen that there also is another list “d4”, based on another group of documents, that contains {orange, paper, peel} in which case “orange” and “paper” would be associated with each other through that list as well as through “d3”.

Thus, based upon the above, if a user were to type “apple” followed by the characters “ci,” then the program instructions implementing this aspect could check the lists, find that “apple” and “cider” are associated through a list, and thus offer “cider” to the user as a possible completion of the incomplete word. Similarly, if a user were to type orange followed by “p”, then “paper” and “plate” would be related to “orange” through list “d3” and “paper” and “peel” through “d4”. Depending upon the next letter entered, the “contextually related” contents of one of the lists would be excluded or the one word in common “paper” would be provided as the “contextually related” completion word.

In the first of the immediately preceding examples, a single word “apple” is used as the basis for choosing “cider” as the possible completion. Whether a single word, multiple words, sentences or even longer segments of text are used as the basis for choosing the possible word completion is dependent on the particular implementation. For simple devices, limitations on memory and processing resources will likely dictate that no more than a few words of prior user input be used as the basis for choosing the word completion. In other implementations this could be a user selectable parameter, a preset parameter in the device or could be adjusted dynamically by the program.

At this point it should be understood that document similarity is based on what words, if any, are shared by each of the pertinent documents and in what frequencies when similar documents are grouped and used to form lists. Thus, by grouping “similar” documents, an assumption is made from the similarity analysis that words that belong to the same broad subject matter are being collected—i.e. “similar” documents are necessarily likely to contain words that pertain to the same subject matter. Hence, the derived associations are likely to be more relevant to a user who is typing a document on that subject than other alphabetically similar or most recently used, but non-associated, words.

While the formation of groups is based on the calculated similarity value between pairs of documents, the details of how the similarity value is used can vary depending on the particular implementation of the invention. For example, as defined above, all document members of a group must have a similarity value above the set threshold value when compared to some portion of the other member of the group. However, that “portion” can be more or less restrictive. The most restrictive definition requires the portion be all members of a group and leads to the largest number of groups and development of word lists that tend to be shorter but the contextual associations among words in the list tend to be stronger. The least restrictive definition of a group is that any one document of the group need only have a similarity value above the threshold value when compared with any one other member of the group. Although it will form the fewest groups, this definition will, for practical purposes, never be used. In practice, if a pre-set threshold is specified, it will likely be set so as to have a reasonable number of groups while having meaningfully usable relationship lists. As currently contemplated, an initial threshold within the range of 50% to 90% would be used as a reasonable empirical starting threshold. Notwithstanding the foregoing, the threshold can be initially set to a value outside that range or can, in some implementations, ultimately end up outside that range either through evolution or affirmative changing of the setting. Regardless of how restrictive the definition of a group is, whenever a new document is found to have a similarity value that is below the set threshold when compared to all of the existing documents on the device, that document becomes the first member of a new group. When less restrictive definitions of a group are used, groups, and their associated word lists, become larger and the strength of the contextual associations between words in the list becomes weaker. Thus, the threshold will clearly be a parameter that, even with its initial value, will vary with the particular implementation of the invention.

It will now be apparent that the threshold value for similarity plays an important role in the formation of groups and thus the lists of contextually associated words. The ability to distinguish between closely related documents will also increase as the threshold value is set higher. Therefore a high threshold value may be appropriate for an attorney who specializes in real estate matters and who creates documents, such as sales contracts or leases, which have a high degree of similarity, while a low threshold value may be appropriate for an editorial writer who produces documents on a wide variety of general interest topics.

The threshold value may be set by the designer of the device depending on the intended use, selectable by the user or set dynamically by the program implementing the invention. One alternative implementation method which, in effect, dynamically sets the appropriate threshold value involves specifying the maximum number of groups that can be created. By doing so, the system will be required to start at a relatively high threshold value and, if the resulting number of groups exceeds the specified maximum, it will adjust the threshold downward until the maximum number is not exceeded. If no more than five groups are specified then the threshold value will be lower than if no more than 20 groups are allowed. The program implementing the invention can be further optionally configured to, as documents are added/changed/deleted, monitor the number of groups and adjust the similarity threshold value so that the number of groups is continually within the specified limit. Additionally, the limit can be made variable, for example based upon the number of documents in the system. In this manner, once a greater threshold number of documents is exceeded, a new maximum can become effective so as to allow for more groups and maintain a meaningful granularity.

Another optional additional method that may be used by the program implementing the invention, is to dynamically set the similarity threshold value by monitoring the performance of the word completion system. For optimal performance of an automatic word completion system, the system must strike a balance between the case of rarely offering a word completion but the completion offered is always appropriate, and the case of always offering a word completion but the completion offered is often inappropriate. By monitoring how often the system offers a possible word completion to the user and how often the user accepts the offered word completion, the program can determine if the strength of the contextual associations is too high or too low. If the user is frequently having to discard suggested word completions, the program can increase the strength of the contextual associations by raising the threshold value or, as discussed above, making the definition of a group more restrictive or both.

Yet another optional use of the similarity threshold value is to rank possible word completions so that the most likely word completion is offered to the user first. The ranking is accomplished by computing the word associations using different similarity threshold values. In implementations of the invention where the user is offered multiple possible word completions, words that are contextually associated with the partial input based on use of a high threshold value would be offered at the top of the list, while those words that are only contextually associated based on the use of a low threshold value, would be offered at the bottom of the list.

To further illustrate how different documents may be grouped, consider a collection of four documents, a draft of the Declaration of Independence, the signed version of the Declaration of Independence, the United States Constitution, and an article reviewing the foundation of United States constitutional law. In the case of a user who has documents covering widely different topics such as the history of baseball, art and literature a relatively low similarity threshold may be set so that there are only four groups of documents. The Declaration of Independence, Constitution, and constitutional law article would likely all be in one group as they are clearly related documents but would not be related to documents on baseball, art or literature. In the case of another user who happens to be an attorney and who has only legal documents on his device the similarity threshold value would have to be set higher so that different classes of legal documents could be distinguished. If the threshold value were set too low, then the fact that words such as “liberty” are present in all four of the documents in our example collection, would be enough for all four documents to be placed in the same group. With a higher threshold value, words like “creator,” which only appears in the Declaration of Independence, and “Senate,” which only appears in the Constitution and the constitutional law article, would result in these documents being placed in two different groups, one on the Declaration of Independence and one on the Constitution.

A second example method for computing contextual associations between words, suitable for use with the present invention, is outlined in the flowcharts of FIGS. 4 a and 4 b. As in the first example method, the second method begins by creating a list of pertinent documents on the device (400) and creating a list of unique words from these documents (410). Again, an optional step is the removal of stop words from the word list (420). For all possible pairs of words from the list, the frequency of co-occurrence within a unit of text is counted (430) and this result entered into a matrix (440). The particular “unit” that is chosen will be a parameter to be decided based upon various factors present for the particular implementation. Generally, if the documents being considered are fairly long (such as a thesis, book or report), then a paragraph is typically considered as the appropriate unit; if the documents being considered are fairly short (e.g. like an SMS message on a cell phone or an e-mail), the whole document may be considered as a unit. Other “unit” options are, for example, the sentence, sections of a document, chapters, pages columns, etc. The unit parameter may be initially fixed, permanently fixed, modifiable or dynamic. In fact, in some implementations, the unit may even be allowed to change based on the length of the document and the portion involved. For example, for patent documents, the appropriate “unit” for the abstract might be the paragraph, for the claims it might be the line or sentence, and for the detailed description it might be, for example, a section, column, paragraph or sentence.

The rows and columns of the matrix are lined with words from the word list. As before, the interchange of rows and columns changes nothing with respect to the invention. The matrix containing the co-occurrence data is used to identify word pairs that are contextually associated, based on their frequency of co-occurrence. The steps of the method to this point can be performed independently of user text entry or, in implementations where the dynamic aspect of the invention are realized, these steps can be performed simultaneously with user text entry so that the contextual word associations are updated as the user enters new text into the device.

Once the contextual word associations have been determined, the device accepts character input from the user until a threshold number of characters are entered (460). Words in the matrix that match the entered characters are identified (470) and the identified words that are contextually related are chosen (480) and offered for selection by the user to complete the character entry (490).

Note that, as described above, the matrices in FIGS. 3 and 5 have been filled with direct co-occurrence information. However, this is not required because there are many other variants of this approach that can be used to the same, better, or in some cases less, effect (Oust as there are many different ways of measuring document similarity) depending upon the particular implementation and documents involved. Thus, the raw information need not be used as is but may, for example, be normalized in some way, ranked, and so on.

Optionally, once the co-occurrence matrix is prepared, it can be reorganized so it is ordered, for example, according to which co-occurrences are higher relative to each letter of the alphabet.

In an example system implementing this method of the invention, and in reference to the matrix (500) in FIG. 5, when a user types “finance” followed by the characters “su,” the system would check the co-occurrence matrix in the row corresponding to “finance” (502) to see which words starting with “su” have the highest co-occurrence score. Thus, if “summary” has a higher score than “sugar,” the system would offer “summary” rather than “sugar” as the appropriate completion word.

In yet another implementation, variants of the above two methods can be run together as a cross-check or to get more suggestions. However, limitations of speed and memory may render this dual method approach impossible, impractical or overly costly for some implementations.

It is to be understood that, while the two approaches have been described in terms of bags and sets, because they are conceptually the most appropriate mathematical description, the actual implementations of these data structures can vary (e.g. they can be lists rather than bags or sets), the particular data structure used to accomplish the result being unimportant to the invention.

Optionally, in some implementations the words in the lists are further processed to reduce them to stemmed form. That is, they may not be the full words but rather words with their endings suitably modified, for example, by one of the many available stemming algorithms. This would enable the program to handle words like “finance,” “financing” and “financial” or “remain,” “remainder” and “remaining” in a uniform manner so that the user will be offered the same suggestions independent of the particular variant of the contextual words he has typed. However, stemming is optional. There may be advantages in not stemming words in some implementations, because the system can differentiate among them if the words are not stemmed and thereby offer one variant on one occasion and another on another occasion.

The number of initial characters that are needed in order for the system to make a suggestion to complete the word is also a parameter that can be preset or set by the user depending on the implementation. As with completion algorithms of the prior art, in most implementations, this is 2 or 3 characters. However, word suggestions may start from the very first character. A second parameter that needs to be defined for a particular implementation is how long the full word needs to be in order for the system to even make a suggestion. As contemplated, the minimum word length is 5 characters or strokes although minimum lengths of 2 or 3 characters (or strokes) can be used. That is, if the characters “su” were typed, the system may not suggest “sure” even if it happens to be the best association because it is too short a word. On the other hand, “summary” and “sugar” are acceptable candidates as they meet the length requirement. This minimum length is also a parameter that will vary with different implementations and possibly different users. Again, the minimum length may be as short as one character so that any and all suggestions (up to the “n” best suggestions) may be offered. Once the associations have been computed the resulting information is used to suggest a word completion.

Having described component aspects in simplified and theoretical form, particular example implementations are now discussed with reference to the example word lists of FIG. 6 (600).

Assume groups of documents to be used have been created, words that appear in each group collected and alphabetized to form the series of lists of FIG. 6.

Example 1

In one simplified example implementation, a user types “finance” followed by the characters “su.” The system checks all the word lists for the word “finance.” It would find that it just occurs in word list #1 (602). Next, it checks all words in that list (or lists, if there is more than one where “finance” occurs) that begin with “su” and it finds the word “summary.” As a result, the system offers to complete the user's word with “summary.” The user can opt to accept that suggestion or not by the appropriate action, in this case pressing an appropriate key (e.g. the “Enter” key). Note, that as a result of the above processing, “sugar” is not on this list and thus it is judged not to be a contextually relevant suggestion.

Example 2

In another simplified example implementation based upon FIG. 6, the user types “two spoons of” and then the characters “su.” Again, the system looks for the words “two” and “spoon” (the plural ending can be assumed to be stemmed in real time if stemming is used) on the word lists. Because, in this example, the word “of” is on the stop word list it is ignored. The system finds “two” in word lists 2 (604) and 3 (606) and finds “spoon” in word list 2 (604). As a result, it looks for words beginning with “su” in all common lists first, i.e. in this example, word list 2 (604). It finds “sugar” in word list 2 (604) and offers it as a suggestion to the user, for acceptance or rejection. In this case, although word list 3 (606) contains the word “summary” which could have been a possible suggestion, it was eliminated because “summary” is associated with “two” through word list 3 (606) but it is not associated with “spoon” on any list.

Example 3

In a further simplified example implementation based upon FIG. 6, a user types in “one” followed by the characters “ch.” Now “one” occurs in word lists 2 (604) and 3 (606). There is no other information to find a common list or to eliminate any list containing “one” as was done with “summary” in the previous example. In this case, both lists are considered and searched for words starting with “ch.” There are two: “chili” in list 2 (604) and “chapter” in list 3 (606). Depending on the implementation, both suggestions are offered to the user for acceptance or rejection or the system can wait until the user types a third character. Note that, if the former occurs, optionally, the system can operate to ignore further character entry for that word or represent a selection based upon the next character entered. This can occur only once or interactively, the former allowing for entry of words not in any list without annoying prompts, the latter allowing for discerning among multiple words with similar starting letters like “heat,”“heavy,” “heather,” “heaven,” hearing” and “health” which all share an initial “hea” or different words having a common root like “finance,” “financing,” “financial,” and “financially.” In the example, a third letter is waited for, so when the characters “cha” are typed in, “chili” is eliminated as a possible completion and just “chapter” is offered.

Even when there are multiple words that have been typed as in Example 2, if there are no common lists or if there is more than one list in common, similar options can be employed, i.e. either all of the suggestions are offered or wait until additional characters are entered. If this is done interactively, eventually the system will most likely eliminate possible options until just one or a few can be offered to the user. As with other aspects described herein this can be preset or set/modified by the user, depending upon the particular implementation involved. Some users find it annoying to have too many suggestions while others find it acceptable. Of course, it is also possible that the words already typed do not belong to any word lists in which case, in one example implementation, no suggestions are offered. Alternatively, in such cases, a prior art approach such as providing a word from a standard online dictionary or a word from a most recently used list can serve as a fallback. These again are implementation-specific decisions.

While the above simplified examples reflect the kinds of situations that can arise, by no means do they exhaust all the possible permutations and combinations of situations that might occur in practice in a more commercially suitable implementation. For such systems additional optional extensions can be employed to make the approach faster, more robust, or otherwise more suitable.

One such optimal extension uses frequency information (i.e. the number of times the word occurs in the set of documents under consideration) stored with or otherwise associated with the words in word lists so that when there is more than one choice available to offer to a user, frequency ranking can be used to specify the offering. This is now described with respect to a simplified example matrix presumed generated from a group of documents using a technique described above illustrated in FIG. 5.

The matrix of FIG. 3 is similar to that of FIG. 5 except that only half of the matrix is filled out because the information in the lower diagonal half would be symmetric to the information in the upper half and it was created from a slightly different group of documents. In the example of FIG. 5, the diagonal entries are left blank but it should be understood that they can alternatively optionally be filled with plain frequency information (e.g. the number of times a word occurs in a document or on the device).

Example 4

In an example system incorporating this simplified example implementation which uses the co-occurrence matrix of FIG. 5, a user types “finance followed by the characters “su.” The system looks up “finance” in the rows (602) and finds the co-occurrence scores of the words starting with “su” in the columns. These are “summary” and “sugar” and their scores are 12 and 1 respectively (identified at the intersection of the two in the matrix (500)). Since “summary” has the higher score, it is offered as a suggestion to the user.

Example 5

In another example system incorporating this simplified example co-occurrence matrix the user types “two spoons of” and then the characters “su.” Now the word “two” (508) has scores of 3 and 7 respectively for “summary” and “sugar” and “spoon” (516) has scores of 0 and 14 for “summary” and “sugar” respectively. Using any of a number of known possible ways for combining scores to produce a unique or at least a ranked outcome a preferred or selected result is obtained and the corresponding word presented. Clearly, in this example, any reasonable formula (e.g. the sum of the scores) should produce “sugar” as the preferred outcome and offer it as a possible completion to the user or alternatively offer both suggestions with “sugar” optionally ranked first.

Example 6

In a further simplified example implementation involving the co-occurrence matrix of FIG. 5, presume a user has typed “one” followed by “ch.” The two possible completions as seen from the table above are “chapter” (512) with a score of 6 and “chili” (514) with a score of 8. In this implementation, based upon the scores, the higher score may be chosen or, since the scores are relatively close to each other, a ranked list may be offered. If the two scores had been identical the order of presentation can be based upon, for example, the alphabetical option, the most recently used option or some other such variant can be used to specify presentation order. Alternatively the system can be set up for cases where the scores are the same to wait, without offering any suggestion, until the user has entered one more character(s) (e.g. “cha”) in which case, in this example, “chapter” would be the unambiguous choice.

In examining the examples above, it is clear that there will often be cases where there will be more than one possible completion to offer the user. In some implementations, the system will not offer the user a possible completion until the user has entered sufficient characters so that an unambiguous choice can be made. In other implementations, the system will rank the possible completions to be offered to the user based on the contextual associations that have already been computed. For example, words with a higher co-occurrence score, words from more frequently or recently accessed lists, or words from lists that have the greatest similarity, can be offered first. Still further, in some implementations, the rankings can be dynamically set in that they change as the user enters more characters (whether in that word or during the course of document creation).

One representative example of how ranking of multiple possible completions can be achieved is now described. Since multiple possible completions are likely to be the norm rather than the exception in many commercial implementations, a ranking method is sometimes desirable.

Assume two words have been typed in a document and a third word is being entered. Assume further that the co-occurrence score of each prior word with possible completions based on the characters already entered is available from the co-occurrence matrix. Then the “relationship score” is defined as a monotonic function of these co-occurrence scores—for example, a simple sum of the co-occurrence scores of candidate completions with the prior words.

Next, a function “f” is defined and stands for the “overall” score as follows:

f=Overall score=if relationship score>1

and if frequency score>1

then overall score=relationship score×frequency score

else overall score=relationship score+frequency score

Here, for example, relationship score=the sum of the co-occurrence scores of the candidate completions for the word being entered with the prior words in the text. Optionally, In some implementations, a limited part of the context will be used to make the computations practical, e.g. the prior ten words may be used.

Given the example above, if the two words of the document have already been typed, and a third word is being entered, in general, there will be several candidate completions. If one presumes that the candidate completion is some word “x”, then the co-occurrence matrix is examined for the entry corresponding to x to find the co-occurrence score for “x” with the first two words of the message. In this example, presume that the scores are respectively 8 and 3. Thus, the relationship score will be 8+3=11. Of course, in this example, the complete number of words can be used. However, with longer documents, it may be desirable as noted above, to limit the number to some subset, for example, ten words in one example implementation.

Depending upon the particular implementation, the frequency score will, for example, be given by the number of times a word has been used by a user in all the documents on the device. Alternatively, in some implementations, a standard frequency of the word, for example as obtained from a corpus or standard usage, can be used instead of the user's own document-derived frequency. In yet other implementations, a function of the two types of frequency, the user based frequency and the corpus based frequency (e.g. the sum of the two) can also be used.

Then ranking is done, for example, by ordering according to increasing or decreasing overall score or by increasing or decreasing values of “f”.

In some cases, two or more candidate completions can have a tie in their overall scores. In such cases, one of many ways to deal with this tie circumstance can be to order the completions by the following ranking hierarchy: exact match>stop word>related word>unrelated word, where “exact match” means the length of the completion=length of characters entered so far; “related word” means word with non-zero co-occurrence score; and “unrelated word” means word with a co-occurrence score of 0.

Of course, it will be appreciated that such a function will have multiple possible variants, only one of which has been described for concreteness, others being straightforwardly obtainable from an understanding of the prior example. Similarly, other ranking schemes can be used that are based on such a function.

A few such variants are now described. First, as mentioned above, any monotonic function (like a sum) of the co-occurrence scores may be used to arrive at a relationship score. Similarly, any monotonic function of the relationship score and frequency score may be used (like the combination of product and sum used to define f above) to arrive at an overall score. Finally, when there is tie, alternative rankings of candidate completions are possible on the basis of the cases described above (e.g. exact match, stopword, related word, unrelated word etc.) and other similar cases.

More general computational schemes may also be employed. For example, the functions described above need not be monotionic, although the results with such general functions may have limited practical value. Finally, a single function instead of two separate functions may also be considered, taking as arguments the co-occurrence scores of prior words and the various types of frequency scores described.

The ranking methods described above pertain to the co-occurrence method of computing associations between words. When the vector space and document grouping method is considered, a similar principle can be applied. Instead of the relationship score being a function of co-occurrence scores, the relationship score can be defined as some suitable function of the candidate completion's group membership: does it belong to multiple groups?, what is its frequency within each group?, and even some function of its various frequencies in multiple groups are the kinds of inputs that may be considered.

In addition to the contextual and automatic aspects which have been illustrated so far, there is also an optional dynamic aspect that can be included. Where such an aspect is used, the user continues to create more documents of different types, the system re-computes the associations of words for example, through the methods described above. This allows the system to keep evolving and improving as it has more and or new data to work with. In the case where a co-occurrence matrix is created by pooling all of the unique words from all of the pertinent documents, the option of updating the co-occurrence matrix in real time, even as the user is creating a new document, is available.

Notably, the above examples have implicitly dealt with unstructured documents (like emails, SMS text messages, word processed documents, etc.) and not with structured documents like forms, letters and spreadsheets. Advantageously, such documents can also be included in the class of documents useable with the invention. Moreover, the structure itself can additionally provide usable contextual clues. For example, in a letter, the closing usually contains words like “Very Truly Yours” or “Sincerely;” in such a situation, as a special case related to that used in the prior art, the system could suggest special words like “sincerely” from a special list because of the location in the document. Similarly, certain words may always appear capitalized (e.g. like the user's name and address) and these special words can once again be handled in a special way. In some implementations the word lists themselves can be further extended so as to contain, as needed or desired, a separate field for storing special information about the word such as capitalization, formatting (bold, italics, etc.), hyphenation, inclusion of special characters, accents on certain characters etc. Since the storing of such information, per se, is known and in use in, for example, certain word processors and it is readily implementable based upon that knowledge, for brevity it is not described herein.

As to treatment of numbers, in any given document there may be special numbers like the current date and year that can be offered as completions in the context of a letter or such numbers can be treated as “stop words.”

Advantageously the above techniques can further be extended to allow for phrase completion. One way of doing this is to find phrases and so-called collocations by using standard frequency-based techniques on a phrase, as opposed to a word basis and then apply the methods described above to these phrases. In addition, an advantage of the techniques disclosed herein is that they can be combined with existing prior art techniques like Most Recently Used (e.g. “MRU”) techniques. For example, one way to use a combination of the present invention and the prior art techniques is to narrow down the possible suggestion list using one of the above methods and then choose one or rank the remainder by an MRU technique. Another is to use regular dictionaries in the absence of any suggestion the system might make. Finally, the order of words on a word list generated by the vector space method can be allowed to change dynamically based on the use the user makes of suggestions he receives.

Note that when a user acquires a device like a computer, cell phone or personal digital assistant for the first time, there are not going to be any user-created documents on the device to which the methods described above can be applied. To allow for this, until there is some history of document creation, one or more of several options exist. One of the conventional prior art techniques could be used until a sufficient number of documents exist. Alternatively, manually created word lists or co-occurrence matrices of a generic type, i.e. that a multiplicity of users are likely to find useful, could be used; or more specialized word lists based on different types of vocabularies or uses (e.g. for bankers, lawyers, businessmen, doctors, domestic uses etc.) could be created by the provider or others. In general, such approaches would be temporary until the user has his or her own document-based history at which point the existing lists or co-occurrence matrices can either be augmented, modified or completely dropped in favor of new word lists and matrices generated by the user's own documents. These same approaches can also be used in the situation where there are documents on the device but the user is starting a new document and there are insufficient words entered to establish associations.

An optional extension of the methods which uses the text being generated as dynamic data will now be described, This method can be used, in some implementations, to augment any of the approaches described above.

Consider a user who is typing a somewhat long document (for example, more than three paragraphs, although this is again a variable parameter that can differ in particular implementations). Suppose they have already typed a few pages (say two pages) and now they are on the third page. The data entered so far can be used in any of two ways. One, as described above, where it provides the context to the next word being typed. Another, as dynamic data, in which the first two pages are used as data from which to create an auxiliary co-occurrence matrix of the type discussed above dynamically (i.e. “on the fly”). This matrix can then be used, as before in conjunction with either the word lists based on document groups of the first method described above or with the co-occurrence matrix of the second method described above. This approach can provide additional advantages because text in the document the user is currently working on will likely be more relevant to the word completions that can be offered to the user than those in the lists. This dynamically created matrix of co-occurrences is either continually updated on the fly, or periodically as the user continues to type—for example after each paragraph or page is completed. Alternatively, instead of using a co-occurrence matrix, a plain frequency matrix containing the number of occurrences of each word in the document being typed could be used.

In instances where it is desirable that the results derived from this dynamically-created data be combined with results derived from the offline data, the intersection of the results can be taken and further, it can be optionally ranked in some way. Alternatively or in case of an empty intersection, the union can be taken or one set of results given greater priority.

When a full keyboard is not available on the device (say a cell phone) there are currently methods (e.g. T9) which allow the user to enter, for example, the numbers 7 and 8 which correspond to the multiple groups of letters p, q, r, s and t, u, v respectively. A method like T9 offers the user an alphabetically ranked list of acceptable combinations (e.g. “pu” and “su” are acceptable, “pt” is not) based on dictionary entries. In such a situation, the computations described in the foregoing example involving any of the possible approaches would still occur, but not just to, for example, “su” but also to, for example, “pu.” Presumably, if the user were typing something like “finance st/pu” then, just as completions like “sugar” would not be acceptable, completions like “pushing” would also not be acceptable under the same logic. If it turned out that that completion was, in fact, relevant under any of the methods above then multiple possible completions would be offered and these would be dealt with as before (e.g. by a ranked list).

Finally, it is to be understood that various different variants of the invention, including representative embodiments and extensions have been presented to assist in understanding the invention. It should be understood that such implementations are not to be considered limitations on either the invention or equivalents except to the extent they are expressly in the claims. It should therefore be understood that, for the convenience of the reader, the above description has only focused on a representative sample of all possible embodiments, a sample that teaches the principles of the invention. The description has not attempted to exhaustively enumerate all possible permutations, combinations or variations of the invention, since others will necessarily arise out of the use of particular hardware or software or through specific types of applications in which the invention can be used. That alternate embodiments may not have been presented for a specific portion of the invention, or that further undescribed alternate embodiments may be available for a portion of the invention, is not to be considered a disclaimer of those alternate embodiments to the extent those undescribed embodiments also incorporate the minimum essential aspects of the invention, as claimed, or an equivalent thereof. 

1. A method performed in a computer device having associated therewith a plurality of unstructured documents having words therein, the unstructured documents being stored in non-transient memory accessible to the computer device, the method comprising: accessing at least some of the plurality of unstructured documents; extracting from each document making up the at least some of the plurality of unstructured documents, a multiset of words; forming a matrix from the documents in which each word in the multiset is represented in a column and each document from which the words came is represented in a row; treating each document as a vector in a multidimensional Euclidean space with the number of dimensions being equal to one of the number of columns or number of rows in the matrix; uniquely pairing the unique documents making up the at least some of the plurality of unstructured documents; measuring the similarity between the pairs of unique documents making up the at least some of the plurality of unstructured documents as a cosine of the angle between vectors; comparing the cosines to a specified threshold to determine relatedness among the documents; and based upon the relatedness, when an input is received by the computer device representing a string of a threshold number of characters, the computer device will provide at least one word that would complete the character string. 